Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches
نویسندگان
چکیده
Feature embedding methods have been proposed in the literature to represent sequences as numeric vectors be used some bioinformatics investigations, such family classification and protein structure prediction. Recent theoretical results showed that well-known Lyndon factorization preserves common factors overlapping strings [1]. Surprisingly, fingerprint of a sequencing read, which is sequence lengths consecutive variants effective capturing similarities, suggesting it basis for definition novel representations reads. We propose feature method Next-Generation Sequencing (NGS) data using notion fingerprint. provide experimental framework estimate behaviour fingerprints k-mers extracted from it, called k-fingers, possible embeddings As case study assess effectiveness embeddings, we use RNA-Seq reads order assign them most likely gene they originated fragments transcripts gene. an implementation tool lyn2vec, produces Lyndon-based
منابع مشابه
Feature Selection for Discrete and Numeric Class Machine Learning
Algorithms for feature selection fall into two broad categories: wrappers use the learning algorithm itself to evaluate the usefulness of features, while lters evaluate features according to heuristics based on general characteristics of the data. For application to large databases, lters have proven to be more practical than wrappers because they are much faster. However, most existing lter al...
متن کاملCorrelation-based Feature Selection for Discrete and Numeric Class Machine Learning
Algorithms for feature selection fall into two broad categories: wrappers that use the learning algorithm itself to evaluate the usefulness of features and filters that evaluate features according to heuristics based on general characteristics of the data. For application to large databases, filters have proven to be more practical than wrappers because they are much faster. However, most exist...
متن کاملCorrelation-based Feature Selection for Machine Learning
A central problem in machine learning is identifying a representative set of features from which to construct a classification model for a particular task. This thesis addresses the problem of feature selection for machine learning through a correlation based approach. The central hypothesis is that good feature sets contain features that are highly correlated with the class, yet uncorrelated w...
متن کاملTime series forecasting of Bitcoin price based on ARIMA and machine learning approaches
Bitcoin as the current leader in cryptocurrencies is a new asset class receiving significant attention in the financial and investment community and presents an interesting time series prediction problem. In this paper, some forecasting models based on classical like ARIMA and machine learning approaches including Kriging, Artificial Neural Network (ANN), Bayesian method, Support Vector Machine...
متن کاملMachine Learning Approaches to Link-Based Clustering
We have reviewed several state-of-the-art machine learning approaches to different types of link based clustering in this chapter. Specifically, we have presented the spectral clustering for heterogeneous relational data, the symmetric convex coding for homogeneous relational data, the citation model for clustering the special but popular homogeneous relational data – the textual documents with...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Information Sciences
سال: 2022
ISSN: ['0020-0255', '1872-6291']
DOI: https://doi.org/10.1016/j.ins.2022.06.005